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Abstract 

Hashing-based methods seek compact and efficient binary codes that preserve the 
neighborhood structure in the original data space. For most existing hashing meth¬ 
ods, an image is first encoded as a vector of hand-crafted visual feature, followed 
by a hash projection and quantization step to get the compact binary vector. Most 
of the hand-crafted features just encode the low-level information of the input, the 
feature may not preserve the semantic similarities of images pairs. Meanwhile, the 
hashing function learning process is independent with the feature representation, 
so the feature may not be optimal for the hashing projection. In this paper, we pro¬ 
pose a supervised hashing method based on a well designed deep convolutional 
neural network, which tries to learn hashing code and compact representations of 
data simultaneously. The proposed model learn the binary codes by adding a com¬ 
pact sigmoid layer before the loss layer. Experiments on several image data sets 
show that the proposed model outperforms other state-of-the-art methods. 


1 Introduction 

Similarity search generally involves a large scale collection of data (e.g. images, videos, documents) 
that are represented as points in a high dimensional space. We are required to find the most similar 
(top-k nearest) instance to the query. This is the most important role for the search engine, as well as 
the areas such as data compression and pattern recognition. It has various applications in real world, 
for example; scene completion [1], image retrieval, plagiarism analysis [2] and so on. 

For most existing hashing methods, an input is first projected into a low-dimensional subspace, then 
followed by a quantization step to get the compact binary vector. Locality Sensitive Hashing (LSH) 
and its extensions [3,4,5,6] based on randomized projections are one of the most widely employed 
hashing methods in industrial practice solving ANN (approximate nearest neighbor) search. The 
most magnitude advantage of this technique is that the random projects can maintain the similarity 
of pairs in original data space provably, meanwhile, the random initialization of projection matrix do 
not need extra computation. This makes LSH suitable for large scale ANN tasks. However, higher 
precision in general require long codes which lead to low recall and more storage cost. 

In contrast to the data-independent hash framework employed in LSH-related methods, most of 
recent research focuses on data-dependent hashing which learns projection function from training 
data. Semantic hashing [7] uses a deep graphical model to learn the hash function, by forcing 
a deep layer to be small. Anchor graph hashing [8] and spectral hashing [9] use spectral graph 
partitioning for hashing with the graph constructed from data similarity relationship. Multidimen¬ 
sional spectral hashing [10] introduces a new formulation which seeks to reconstruct the affinity 
between datapoints, rather than the distance. Binary reconstruction embedding [11] learns hash 
function by explicitly minimizing the reconstruction error between the original distances and the 
Hamming distances of the corresponding binary embeddings. Minimal loss hashing [12] formu¬ 
lates the hashing problem as a structured prediction problem with latent variables and a hinge-like 
loss function. PCA-ITQ (Iterative quantization) [13,14] one recent data-dependent method which 
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outperforms most other state-of-the-art approaches, this method finds an orthogonal rotation matrix 
to refine the initial projection matrix learned by principal component analysis (PCA), so that the 
quantization error of mapping data to the vertices of binary hypercube is minimized. 

All of the hashing methods we mentioned above learn hash function based on some hand-crafted 
visual descriptors (e.g. GIST [15], BoF [16,17]). However, those hand-crafted features can just 
extract the low-level representation of the data, which may not be able to preserve the semantic 
similarities of image pairs. At the same time, the hash function learning procedure is independent 
with the feature extraction step. Hence, hash code learning can not give feedback to the feature 
representation learning step, and vice versa. 

In this paper, we introduce a supervised hashing method based on a well designed deep convolutional 
neural network, which combines the feature learning and hashing function learning together. We 
have compared our model with multiple hashing approaches. The results show that our method can 
achieve state-of-the-art, even better performance for image retrieval. 


2 Related Work 


Convolutional neural networks (CNNs) [18,19] have demonstrated its great power in several visual 
recognition field, and exceeds human-level performance in many tasks, including recognizing traffic 
signs [20], faces [21,22], and hand written digits [20,23]. Meanwhile, the deep convolutional neural 
network based approaches have recently been substantially improved on state-of-the-art in large 
scale image classification [24,25], object detection [26,27], and many other tasks [28,29]. 

Compared with the conventional hand-craft visual descriptors, which are designed by human engi¬ 
neers with an unsupervised fashion for specific task. Deep convolutional neural networks encode 
images into multiple levels of representation. With those suitable representation, we can discover 
complex structures hidden behind high dimensional data. The key to the success of such deep ar¬ 
chitecture is their ability of representation learning. For classification tasks, the higher layers of 
representation reserve the important aspects of the input for discrimination and inhibit the irrelevant 
variations. In this work, based on their great representation learning capability, we utilize the deep 
CNNs to automatically learn image feature instead of using hand-craft feature(e.g.,Gist, Bof). 

Currently, as the great success made by machine learning on many tasks, numerous models have 
been introduced to hashing applications. Semantic hashing [31] introduce a probabilistic model to 
learn the marginal distribution over input vector. The assumptions used in semantic hashing fit the 
constraint in equation[T]ideally. However, semantic hashing needs complex and difficult to train the 
network. CNN hashing [32] is a two stage hash method to learn optimal hashing code. In the first 
stage, an approximate hashing code is learned by decomposing the similarity matrix S into a product 
form S Ri iff77^. The k-th row in H is regarded as the approximate target hashing code, then the 
learned hashing code is assigned as the supervised information to learn the hash function. This two 
stage framework leads to good performance. However, the matrix decomposition limits the scale 
of the application. Meanwhile, the learned image representation can not give feedback for learning 
better approximate hash code. We propose a method that can combine feature learning and hashing 
code learning together. This end-to-end architecture improves previous state-of-the-art supervised 
and unsupervised hashing methods. 


3 The Proposed Model 

3.1 Hash Statement 

Generally speaking, a good code for hashing satisfies three conditions [30]:(1) projecting similar 
pairs in data space to similar binary codewords in Hamming space (2) a small number of bits to 
encode each sample (3) little computation for input projection. For a given data set {xi,X 2 , ■■■,Xn} 
with Xi € R‘^, let with yi G {0,1}™ be the binary code for each input. In general, we 

assume different bits are independent of each other, that is yi = [hi{xi), h 2 {xi ),..., hm{xi)]'^ with 
m independent binary hashing functions We require that each bit has a 50% chance of 

being one or zero. Our goal is to minimize the average Hamming distance between similar pairs, we 
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obtain the following problem with the goal: 


minimize : E WijWvi-yjW^ 

s.t. yi G {0,1}'' 

= 0 

i 

- E ^ 

n 


( 1 ) 


where the constraint 2 /i = 0 corresponds to the 50% probability assumption, while the constraint 
i ViVl corresponds to the independent assumption of bits. 



Figure 1: Illustration of the end-to-end deep hashing learning network. 

In the following section, we describe the model in detail. Figure [T] shows a example of the pipline 
of the deep convolutional neural network, a linear hash projection layer is followed by a sigmoid 
quantization layer, the network was trained use backpropagation. 

Denotes X to be the image space,A’ = {xi,X 2 , Xn}, our goal of hash learning for images is to 
find a projection T-l : X —^ {0,1}"*. because it is NP hard to compute the best binary functions hk{ ) 
for image set X [9], hashing methods adopt a two-stage strategy to learn hk{-), the project stage by 
m real-value hash functions and the quantization stage by threshold the real-values to 

binary. 

3.2 Model Formulation 

In standard locality sensitive hashing, each hash function hk is generated independently by setting 
a random vector Ik from a gaussian distribution with zero mean and identity covariance. Then the 
hash function can be expressed as hp{x) = sign{lpX). In our approach, the input image was first 
been mapping to the feature vector, with multiple convolution and pooling operation, 

IT,.(C'iViV(x))j (2) 

where m denotes the number of hash function, CNN{x) denotes the feature extraction on the input 
images, Wj is the projection vector for the k — th hash function. Each hash function hk{-) is. learned 
independently by put a linear mapping on the same feature representation layer. 

Sigmoid function refers to the special case of the logistic function, which has an ”S” shape, due to 
its easily calculated derivatives and physiological characteristics, sigmoid function was widely used 
in neural networks until the ReLU(rectified linear units) and its extensions get widely used. Sigmoid 
function can compress the signal to [0,1], experiments in our work show that the output of sigmoid 
layer most distribute at the tailer of the sigmoid curve, which has near zero or near one value. 

CNN achieve great success on image classification tasks, the major drawback of many feature learn¬ 
ing structure is their complexity, those alforithms usually require a careful selection of multiple 


hk{x) = sigmoid I 
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hyperparameters, such as learning rates, momentum, weight decay, a good initialization of filter 
also be a key to good performance. In this paper, we adopts a simple framework of CNN for the 
needs of fast nearest neighbor search. 

4 Experiments 

4.1 Data Sets 

To evaluate the proposed method, we use two benchmark datasets with different kinds of images, 
MNIST [33fl and CIFAR-10 [35B The first dataset is MNIST, with 70K 28*28 greyscale images 
of hand written digits from zero to nine, has a training set of 60,000 examples, and a test set of 
10,000 examples.The second dataset is CIFAR-10, consists of 60000 32x32 colour images in 10 
classes of natural objects, with 6000 images per class. There are 50000 training images and 10000 
test images. 

4.2 Baselines 

In this paper, four most representative unsupervised methods, PCA-ITQ, LSH, SH, PC AH and 2 su¬ 
pervised methods KSH, BRE are chosen to compare the performance of the proposed hash methods. 

We randomly chose 1000 images as the query sequence. For unsupervised methods, all of the rest 
images as training samples, for supervised methods, we take original training set as training samples. 
For the proposed method, we directly use the image pixels as input, for the baseline methods, we 
follow the general setting to get the image representations, use 784-dimensional grayscale vector for 
MNIST image, and 512-dimensional GIST vector for CIFAR-10. We mean-centered the data and 
normalized the feature vector to have unit norm. We adopt the scheme widely used in image retrieval 
task to evaluate our methods, including mean average precision, precision-recall curves, precision 
curves within hammming distance and precision curves w.r.t number of top returned images. 

4.3 Model Configuration 

We implements the proposed methods based on open source Caffe [37] framework, we use 32, 32, 
64 filters with size 5*5 in the first, second, and third convolutional layers, with each followed by the 
ReLU activation function. The hash mapping layer located at the top of the third pooling layer, then 
a compression sigmoid layer is followed. 

Table 1: mAP on MNIST and CIFAR-10 dataset, w.r.t different number os bits 


method 

MNIST(MAP) 

CIFAR-IO(MAP) 

16bits 

24bits 

32bits 

48bits 

16bits 

24bits 

32bits 

48bits 

LSH 

0.250 

0.284 

0.310 

0.430 

0.298 

0.344 

0.331 

0.389 

SH 

0.347 

0.383 

0.393 

0.387 

0.352 

0.355 

0.379 

0.381 

PC AH 

0.351 

0.344 

0.332 

0.309 

0.291 

0.280 

0.272 

0.261 

PCA-ITQ 

0.515 

0.550 

0.581 

0.610 

0.427 

0.445 

0.453 

0.469 

SKLSH 

0.182 

0.231 

0.218 

0.256 

0.288 

0.312 

0.334 

0.394 

KSH 

- 

0.891 

0.897 

0.900 

0.303 

0.337 

0.346 

0.356 

BRE 

- 

0.593 

0.613 

0.634 

0.159 

0.181 

0.193 

0.196 

CNN+ 

- 

0.975 

0.971 

0.975 

0.465 

0.521 

0.521 

0.532 

Ours 

0.992 

0.993 

0.995 

0.993 

0.714 

0.718 

0.736 

0.728 


4.4 Results Analysis 

Table [1] and Figure |2 to [3 show the precision-recall curves and other two evaluation curves com¬ 
parison on the evaluate datasets, all of the unsupervised methods are obtained by the open source 

*http://yann. lecun. com/exdb/mnist 
^http://www.cs. toronto.edu/ kriz/cifar.html 


4 






















implementations provided by their respective authors, we directly use the results of the supervised 
methods KSH and BRE obtained by [32], KSH need extra time for k-means learning, with respect to 
large scale data, the hashing learning may suffer the problem of time consuming. We also compare 
the mAP result with CNNH, the MAP result of our method gains 0.2% w.r.t to CNNH on MNIST. 
Particularly, our model indicate a increase of 18% - 27% on CIFAR-10 w.r.t state-of-the-art method. 
The substantial superior performance verifies the efficiency of our end-to-end framework. 

Compared to the conventional methods, CNN based methods can achieve much better result, which 
we think is the influence of the automatically learned image representation. As we mentioned before, 
good hashing code satisfies the requirement of similarity preserve, less bits and little computation. 
Hence, any time consuming computing should be avoiding. In this work, we adopt a simple CNN to 
learn feature representation and hashing code, more complex model can promote the performance, 
but the cost for fast similarity search will increase as well. 


mnist 


mnist 




(a) 


(b) 


Figure 2: Quantitative comparison results on CIFAR-10. (a) Precision-recall curves with 48 bits, (b) 
Precision curves w.r.t numbers of top returned images 
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Figure 3; Quantitative comparison results on CIFAR-10. (a) Precision-recall curves with 48 bits, (b) 
Precision curves w.r.t numbers of top returned images 


5 Conclusion 

In this paper, we proposed a end-to-end supervised method for image retrieval, which simultaneously 
learns a compact hash code as well as a good feature representation of images. This method has no 
restrict on data scale and can generate hash code with little computation, the model can be boosted 
by GPU acceleration and multithreading. The proposed method learn the hash code with the image 
label, we just use some simple CNN model to learn the hash code, experiments show that the retrieval 
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results can be promoted by more powerful classification model. Even with such simple model, our 
method has astonishing performance gains over state-of-the-arts. 
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